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IMPROVED METHOD FOR CONVERSION OF 
PHONETIC CHINESE TO CHARACTER CHINESE 
Field of the Invention 

This invention relates to automated methods for converting phonetic 
5 Chinese to character Chinese. 
Background of the Invention 

Because the Chinese language uses thousands of characters-in contrast 
to the English language's use of 26 characters-the development of modern Chinese 
word processing equipment is a substantial problem. Obviously, a typewriter 
1 0 keyboard consisting of thousands of keys is impractical. 

Phonetic (sometimes called phonemic) input schemes, based on the use 
of normal keyboards, are often used to input Chinese character text into a computer 
or word processor. These schemes known as the Mandarin Phonetic system in 
Taiwan and the Hanyu Pinyin system in the People's Republic of China involve the 
1 5 transliteration, for example, of the five following Chinese characters: 



n S f is | 

respectively into the five following single syllables reflecting the pronunciation of 
the Chinese characters: 

tai2 wanl you3 tai2 fengl 

Because this phonemic input does not require special keyboards or the 
mastery of special coding schemes, its use is advantageous. However, since the 
number of pronounced Chinese syllables is significantly fewer than the number of 
Chinese characters, it suffers from the problem of ambiguity. A well-educated 
Chinese may be expected to know about 6000 characters, but the number of syllables 
is about 1200. Thus, one syllable may represent many different characters. For 
example, in the Hanyu Pinyin transliteration system, all the following characters are 
pronounced shi4: 



Not surprisingly, translation of the 5-syllable phrase 



tai2 wanl you3 tai2 fengl 

into Chinese characters presents over 21,000 (i.e., 9x5x6x9x9) different 
combinations of characters since "tai2», "wanl". "you3". W, and "fengl" 
5 respectively represent at least 9.5,6,9, and 9 different Chinese characters. 

As described in their paper ("Removing the Ambiguity of Phonetic 
Chinese by the Relaxation Technique," Computer Pjocessinfi pi Chin^fc Oriental 
Languages Vol. 3, No. 1. May 1987) Lin and Tsai. attempting to overcome the 
above-described ambiguity problem, propose a method of converting phonetic 
10 Chinese syllables to character Chinese using the relaxation process widely used m 
image analysis problems, such as edge detection, curve detection and shape 
recognition. More specifically, they employ the relaxation process to obtain the 
optimal path through the lattice of possible characters, making use of the lexical 
probabilities of the characters given the syllables and the transition probabilities of 
15 adjacent characters and adjacent syllables. 
Summary of the Invention 

I have discovered a simpler, yet equally effective, method for converting 
phonetic Chinese into Chinese characters. In accordance with the principles of my 
invention, this conversion is effected by obtaining the optimal path through the 
20 lattice of possible Chinese characters by calculating only the probability of adjacent 
Chinese characters appearing in text. 

More specifically, an automated method is disclosed for converting n 
phonetic Chinese syllables S , through S n in a text into n Chinese characters C , 
through C n . In accordance with my method, for each Chinese syllable S if I generate 
25 a group of Chinese characters Ci! through C*. collectively referenced as C u . z , 
possibly corresponding thereto. Then I compute the optimal path through the lattice 
of possible Chinese characters C„-, through C,,-, to derive the mostly likely n 
Chinese characters C, through C„ corresponding to the n phonetic Chinese syUables 
S ! through S n . This optimal path is beneficially computed, in accordance with the 
30 principles of my invention, by deriving the probability of the use of adjacent Chinese 
characters Cj and C^ in said text based upon the frequency of the ordered 
appearance of said adjacent Chinese characters in a large corpus of Chinese text, 
multiplying together the derived probabilities, and selecting the path with the highest 
probability as the optimal path. 
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Brief Description of the Drawing 

Further features and advantages of my invention will become apparent 
from the following detailed description, taken together with the drawing, in which: 

FIG. 1 is a flow diagram of the Chinese syllable to Chinese character 
5 conversion method according to my invention; 

FIG. 2 is a Chinese character matrix indicating the possible Chinese 
characters (C } - C9) corresponding to each of the particular Chinese syllables 
(S1-S5); 

FIG. 3 discloses the same Chinese character matrix shown in FIG. 2, 
10 but, where for ease of explanation, the Chinese characters have been replaced by 
English letters "a" through "h"; 

FIG. 4 is a matrix reflecting the number of times representative Chinese 
characters ("a M through M h") appeared adjacent one other in a large corpus of Chinese 
character text; and 

15 FIG. 5 shows how the frequency numbers of FIG. 4 are used to compute 

the optimum path though the matrix of FIG. 3 to thereby select the Chinese 
characters most likely corresponding to the particular Chinese syllables. 
Detailed Description 

I will now describe my invention in terms of selecting the optimal path 

20 through the lattice of Chinese characters generated from a phonetic input. So 

consider the example below in which the phonetic Chinese syllables ("tai2 M , "wanl", 
"you3", H tai2", "fengl") each refer respectively to one of the Chinese characters 
found beneath it Thus, "wanl" refers to one of the five Chinese characters beneath 
it, and "you3" refers to one of the six Chinese characters beneath it 




The Chinese ehamcters *• — •* f* """^ "* ^ *" 

i< the sentence in Chinese: "Taiwan has typhoons.' 

I will now describe the methodology used to selec, the opumum pa* . 
the above-described example in accordance with the principles of my tavenno* 
5 S^To me. 1 identifies -he starring poin, of my method as the Chinese syDabfc 
r^ughS. «. in the ulustrarive example^, through S 5 cc^otng -o the 

cnirrUbi«"^"*- i ^> u3vw " and " fensr ' respecn y - • 

SSUh s*p 2 of FIG. J. Chinese chance C, *n>* 
^ possib-c Chinese characters corresponding to a parricuiar Chtnese ^ « 
0 Jerated for each Chinese syllables,, where the numbers of such characters z 
^foTeachamese syllable. Thus, with reference to FIG. 2 whic shows the 
Conned by me generaoon of such Chinese charaae*. V is 9 for W «. 

resents one o, nine possib,e characters, V is 5 fcrW" sutc* W 
™ BO net»fnvepossiblecha ra c^,Vi S 6for>ura S shown and « 

, 5 ftTTgr since, as shown, "feng!" represents one of nine possible el— 
taaccord.ncewiths^pSofHG.l.megene^tedCranesecharacters 

are formed into a matrix where the characters generated for the firs, syllable form the 
Z 2 column C t and the characters generated for the Us, syllable (number „, 
formmerighthandcolumnC,. Thus, win, reference » the matn> ■« |Ha 2. 

20 ChinesecharacterC, for-tai2" (which is really chanoerC,, mtomamx 

Xing the numbering memodolog, of step 3 of FIG. 1) ^ 
symbol L.W.C^d^C,taW(^-*««C„-*. 
mTnix) is the second Chinese symbol under T-T . Chinese £ * 

(which is character C „ in the maoix) is the third Chinese symbol unto ^and 

25 so on until Chinese cnaracttr C, for ".air (which is cha^ C„ . the matnx) 

the ninth ainese symbol under "tai2". _i v j i _a. 
, cWr^charac^C.taW-isthetetC^symbolunder 

HG IsinceUisAefimsymWformeseec^cr^ter. Sunuari, Ctanese 
30 Thaler C, for "wanl" is the seeond Ounese symbol under "wan " and would be 
^^C.maccorfancewiU.^numneri.gschemeofHG. 
ndsymboHorftesecondcha^. FtaaUy, .he las, Chtaese change, for 
secono symi*« ( designated C 25 in 

"wanl" is the fifth Chinese symbol under wanl anawoiuaucu* 5 « 
wani isincmui ^ .^ftjir, 1 «nce it is the fifth symbol for the 

accordance with the numbering scheme of HG. 1 since it is we m 7 

35 second character. 
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Chinese character Cj for the last symbol "fengl" (i.e., S5) is the first 
Chinese symbol under "fengl" and would be designated C$\ in accordance with the 
numbering scheme of FIG. 1 since it is the first symbol for the fifth character. The 
last symbol in the matrix found at its bottom right corner is C9 for "fengl" since it is 
5 the ninth symbol under ,! fengl " and is designated C59 since it is the ninth symbol for 
the fifth character. 

FIG. 3 depicts the same Chinese character matrix as that shown in FIG. 
2, but, where for ease of explanation later of the selection of the optimum path 
through this matrix, some of the Chinese symbols have been replaced by the English 
10 designations "a" through "h". Thus, "a" in FIG. 3 represents the first symbol under 
M tai2 M in FIG. 2, "b" in FIG. 3 represents the second symbol under "tai2 M , "c" in FIG. 
3 represents the first symbol under "wanl" in FIG. 2, and so on. These designations 
are arbitrary (except identical Chinese characters obviously are identified by the 
same English designation) and serve only to facilitate further explanation of the 
1 5 principles of my invention. 

FIG. 4 shows a frequency matrix reflecting the use of adjacent Chinese 
characters (Q_i , Cj) in a large corpus of Chinese text. This matrix was derived by 
analysis of the corpus and noting the number of times ordered pairs of Chinese 
characters appeared in the text. Again, for ease of explanation, the same English 
20 letters "a" through "h", used in FIG. 3, are again used respectively to identify the 
adjacent Chinese characters shown in FIG. 4. The letter "i" refers to the Chinese 
sentence delimiter (Chinese 'period*), and as hereinafter explained, is used to 
delineate the beginning and the end of a phrase of Chinese text. Since the number of 
Chinese characters is quite large, FIG.4 shows only a representative portion of the 
25 actual frequency matrix which is approximately a 6000 by 6000 matrix and 
represents use of 6000 Chinese characters. 

FIG. 4 shows the number of times ordered pairs of Chinese characters 
(Cj_i t Cj) appeared in the corpus. For example, the pair "aa" appeared 5 times in 
the Chinese corpus since the number 5 appears for Cj_ 1 = a, and Cj = a. Similarly, 
30 the pair "ab" appeared 0 times since the number 0 appears for C j_ 1 - a, and C j = b. 
On the other hand, the pair "ac" appeared 1513 times (see, C;_] = a, and C; = c). 
One notices immediately, that while the pair "ac" is found quite frequently in 
Chinese text, its inverse namely "ca" — is found only infrequently (e.g., "1" in 
FIG. 4). One notices readily that many pairs were not found at all in the corpus — 
35 sec, for example, w ab", M aT, "ah", "ba\ "bb", "be" and so on. 
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Thc letter "i" in FIG. 4 refers to the Chinese sentence delimiter and is 
used to pad both ends of an input sequence. Thus, where "a" is the first character, 
the ordered pair (Cj_ u C { ) really constitutes the pair "ia". Similarly, where "a" is 
the last character, the ordered pair (Ci-!. Cj) really constitutes the pair "ai". FIG. 4, 

5 therefore, indicates that "a" was the first character 322 times (sec Cj_i = i, Cj = a) 
and that "a" was the last character only 22 times (see Cj_i = a, Cj = i). 

Returning now to FIG. 3, our goal is to find the optimum path through 
the 9 by 5 matrix (comprising C „ through C 59 ) using the frequency information of 
FIG. 4 to thereby identify the five Chinese characters most likely corresponding to 

10 the five Chinese syllables "tai2 wanl you3 tai2 fengl". More specifically, there are 
21,870 possible paths through matrix (i.e., 9x5x6x9x9). One path would be that 
shown in the first row in FIG. 3 "aceag". Another path would be that shown in the 
second row "bdfbh". Of course, by combining the first and second rows, other 
possible paths can be derived easily - e.g.. "adfbh", "bdeag", etc. In fact, FIG. 5 

15 shows the 32 combinations possible by combining just the symbols in the first and 
second rows in FIG. 3. The path "aceag" is listed as the first path in FIG. 5, "aceah" 
is listed as the second path, "acebg" the third ... and "bdfbh". the last path. 

FIG. 5, also, shows to the right of each path the frequency calculation 
used to derive the optimum path. For example, the frequency calculation (or 

20 probability calculation) for the first path "aceag" is derived by looking up in FIG. 4 
the individual ordered occurrence frequencies for the following pairs "ia". "ac", "ce", 
"ea", "ag", "gT - namely, 322, 1513, 26, 25, 2, 41. Then these six numbers are 
multiplied together to derive the score for the use of the Chinese characters "aceag" 
resulting in the large number 25,967.013,800 shown in line 1 of FIG. 1. The only 

25 other probable path is shown on the third line - namely, "acebg" - and, in fact, 
represents the optimum path because the calculated probability number 
30,121,736,008 is larger than any other calculated number. Thus, the Chinese 
characters "acebg" are selected as representing those characters defined by the 
Chinese symbols "tai2 wanl you3 tai2 fengl" and, in fact, recite the sentence in 

30 Chinese: Taiwan has typhoons." 

In more mathematical terms, I have discovered that the optimal path 
through the matrix represented by the Chinese characters can be derived by reference 
to the following formula where "c" refers to a Chinese character and p(Ci I Ci_ i ) 
refers to the probability of Chinese character i, given the previous adjacent 

35 occurrence of Chinese character i- 1 : 
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n " 1 i 
argmax IT p(c; Icj.i) 

i=l 

Conceptually one enumerates all of the possible paths though the lattice 
and then picks the best path according to the above formula. Of course, truly 
enumerating all possible paths is computationally expensive. However, as indicated 
5 in the above formula and previous discussion, the score for the entire path is 
computed by multiplying out the probabilities (frequencies) for adjacent pairs of 
characters, and one can therefore significantly reduce the paths that one has to 
consider by using dynamic programming (= Viterbi algorithm) techniques. In 
formal terms, note that for any syllable syl j, and for each (a Chinese character 

10 corresponding to syl j), one is considering the scores of all paths which end in c 

Note however, that we only need to keep the best path ending in cjj. This is because 
when we move on to the next syllable syl i+ j and consider all characters c i+ we 
want to compute the scores for paths ending in the pair of characters CijCi + Jik (as 
well as the scores for paths ending in other characters corresponding to the pair 

15 syl i syl i+ 1 ); we do this computation for each path ending in c i j , by multiplying the 
score for that path by the frequency of occurrence of the pair of characters 
CijCj+i^, freqfcijCi+i^). However, it is clear even before we perform these 
calculations that we really only need to consider the best scoring path ending in c if j: 
since the same multiplier freq(CijCj + i is used to compute the frequency of each 
20 path ending in the pair c itJ c i+ltk , the best path ending in the pair c if jCi +1(k will 
simply be the best path ending in c i(j concatenated with the character c i+Kk . We 
therefore discard all but the best scoring path leading up to Cy. Thus, rather than 
keeping around cci •cc 2 -..cc i _ 1 paths (where cc m is the number of characters 
possibly corresponding to syl m ) ending in c ifj , we only keep one. To illustrate, let 
25 us return to FIG. 5, which, as we have noted, represents a subset of the possible paths 
through the lattice given in FIG. 3. Suppose that we are considering syllable 
M wanl K , in the third column of FIG. 5. Note that there are four distinct possible 
paths illustrated in FIG. 5 which end in possible transliterations of "wanl", namely: 

• M iac M , the initial subpath of the final paths numbered 1-8; 
30 • H iad\ the initial subpath of the final paths numbered 9-16; 

• "ibc\ the initial subpath of the final paths numbered 17-24; 

• "ibd", the initial subpath of the final paths numbered 25-32. 
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Following the above formal description, we can eliminate "ibc" because its score 
0x0=0 obviously does not compete with the score of the other path ending in "c", 
namely "iac", whose score at this point is 322x1513=487,186. (Note that in practice 
the value 0 is not actually used, but rather an arbitrarily chosen very small valued 

5 constant; this is for technical reasons which do not affect the discussion here.) 

Naturally, there is no way that longer paths "ibce", "iter 1 can compete with "iace" or 
"iacf respectively, either, this is because at the point we compute the scores for 
those 'paths, we are multiplying the scores for the pairs "ce" and "cf with the scores 
for the paths "ibc" and "iac" and since we already know that "ibc" is an inferior 

10 candidate to "iac", it follows that "ibce" and "ibd" are inferior candidates, 

respectively, to "iace" and "iacf. The subpath "ibc" can therefore be eliminated 
from further consideration. Both of the paths ending in "d". "iad" and "ibd" have 0 
scores, and in principle we could eliminate both of these, but in practice one of them 
- this case "iad" - is kept around; it will be eliminated in later steps. At this point, 

15 then, we have eliminated two paths - "ibc" and "ibd" - and retained two - "iac" and 
"iad". Moving on to the syllable "wanl" in the fourth column of FIG. 5, we now 
have to consider the following possible paths: 

• "iace". the initial subpath of the final paths numbered 1-4; 

• "iacf, the initial subpath of the final paths numbered 5-8; 
20 • "iade", the initial subpath of the final paths numbered 9-12; 

• "iadf ', the initial subpath of the final paths numbered 13-16. 

Of those paths ending in "e", namely "iace" and "iade". the former has a value of 
322x1513x26=12.666,836 and the latter a value of 322x0x0=0; "iade" is therefore 
eliminated from further consideration. Of those paths ending in T. namely "iacf 

25 and "iadf'. both have a value of 0 (322x1513x0 and 322x0x0 respectively); as above, 
one of these - in this case "iacf - is kept around in practice. At this point we have 
kept only two paths "iace" and "iacf. and have ehminated six other possibilities; 
"iade" and "iadf were eliminated on this step; and "ibce", "ibef \ "ibde". "ibdf were 
eliminated on the previous step because of the elimination of "ibc" and "ibd" on that 

30 step. 

The frequency statistics of FIG. 4 used in the illustrative embodiment 
were derived from a corpus of 2.6 million characters of Chinese newspaper text 
Estimates of the probabilities of the appearance of 2 adjacent Chinese characters can 
be derived by dividing the frequency of occurrence of a sequence by the size of the 
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corpus. However, since all probability estimates thus derived represent the 
frequency divided by the corpus size, in practice, the corpus size is omitted from the 
calculation since it does not affect the maximization (thus the use of frequency rather 
than estimated probability in the above illustrative description of my method). 
5 To evaluate the effectiveness of my method, seven short samples of text 

of differing length were chosen, representative of various writing styles ranging from 
very classical to colloquial: 

i) Ad [classical] 

ii) Report [classical] 

10 iii) Newspaper social commentary taken from the training corpus [semi -classical] 

iv) Essay [more colloquial] 

v) Narrative [more colloquial] 

vi) Short story [colloquial] 

vii) Exposition [colloquial], 

15 The performance of my method is given in the table in terms of percentage correct 
by character (hit rate) for each of the styles; the hit rate from my method is specified 
in the third column and should be compared with the hit rate achieved by merely 
picking the most common character given the pronunciation, which is given in the 
second column: 
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Styie Uxical Probabilities Only My Method 



i [class.] 

ii [class.] 

iii [semi-class.] 

iv [more coll.] 

v {more coll.] 

vi [coll.] 

vii [coll.] 

Total 



76% 93% 

73% 90% 

76% 98% 

69% 73% 

72% 86% 

71% * 9 % 

71% 92% 

73% 90% 



10 A trend evident in these data is that there is some dependence upon style: my current 
raining corpus is heavily classical in style since it is mostly derived from 
newspapers. As a consequence, texts (i-iii) which are classical in style are better 
rendered than the more colloquial texts, with the exception of (vii). I expect that tins 
style dependence would become less marked if the training corpus were expanded to 

15 include other styles. 
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Claims: 

1. An automated method for converting n phonetic Chinese syllables S ; 
through S n in a text into n Chinese characters Ci through C n comprising the steps 
of: 

5 generating for each Chinese syllable S i a group of Chinese characters 

Cii through Cfe, collectively referenced as C n . 2 , possibly corresponding thereto, 

computing the optimal path through the lattice of possible Chinese 
characters C n . z through C nl _ z to derive the mostly likely n Chinese characters C x 
through C n corresponding to the n phonetic Chinese syllables S i through S n 
10 SAID METHOD BEING CHARACTERIZED IN THAT: 

said computing step comprising deriving the probability of the use of 
adjacent Chinese characters C, and Cj_j in said text based upon the frequency of 
the ordered appearance of said adjacent Chinese characters in a large corpus of 
Chinese text, multiplying together the derived probabilities, and selecting the path 
15 with the highest probability as said optimal path. 

2. The method of claim 1 further characterized in that: 

said computing step is effected using dynamic programming techniques. 

3. The method of claim 1 wherein the corpus of Chinese text comprises 
one or more corpora of Chinese texts representative of one or more Chinese 

20 language styles. 



AMENDMENTS TO THE CLA.MS HAVE BEEN FILED AS FOLLOWS 

4. An automated method of converting one or more phonetic syllables of 
a language into one or more language characters, the method comprising the steps 
of: 

generating for each phonetic syllable a group of one or more language 
5 characters possibly corresponding thereto to form a lattice of characters; and 

computing an optimal path through the lattice of characters to determine 
the roost likely sequence of characters corresponding to the one or more phonetic 
syllables; 

CHARACTERIZED IN THAT: 
10 the computing step comprises: 

deriving for a path a measure of likelihood of its sequence of language 
characters based upon one or more measures of likelihood of occurrence of 
successive characters of the path appearing in a corpus of language text; and 

selecting as the optimal path the path with the greatest measure of 

15 likelihood. 

5. The method of claim 4 wherein the step of deriving comprises 
determining a measure of likelihood for a path by combining measures of likelihood 
for successive pairs of characters in the path. 

6. The method of claim 5 wherein measures of likelihood for successive 
20 pairs of characters are combined by multiplication. 

7. The method of claim 4 wherein the .:orpus of language text comprises 
one or more corpora of language texts representative of one or more language styles. 

8. The method of claim 4 wherein the language is Chinese. 

9. The method of claim 4 wherein the measure of likelihood is a 
25 probability. 



10. The method of claim 4 wherein the measure of likelihood is a 

frequency. 
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